related works.tex (4907B)
1 \section{Related Work} 2 \label{sec:fitb:related work} 3 The \textsc{nlp} and knowledge base related work is presented in Chapter~\ref{chap:context}, and the relation extraction related work is presented in Chapter~\ref{chap:relation extraction}. 4 The main approaches we built upon are: 5 \begin{itemize} 6 \item Distant supervision (Section~\ref{sec:relation extraction:distant supervision}, \cite{distant}): the method we use to obtain a supervised dataset for evaluation;% 7 \sidenote{As explained in Section~\ref{sec:relation extraction:clustering}, this is sadly standard in the evaluation of clustering approaches.} 8 \item \textsc{pcnn} (Section~\ref{sec:relation extraction:pcnn}, \cite{pcnn}): our relation classifier, which was the state-of-the-art supervised relation extraction method at the time; 9 \item Rel-\textsc{lda} (Section~\ref{sec:relation extraction:rellda}, \cite{rellda}): the state-of-the-art generative model we compare to; 10 \item \textsc{vae} for relation extraction (Section~\ref{sec:relation extraction:vae}, \cite{vae_re}): the overall inspiration for the architecture of our model, with which we share the entity predictor; 11 \item Self\textsc{ore} (Section~\ref{sec:relation extraction:selfore}, \cite{selfore}): an extension of our work, which, alongside their own approach, proposed an improvement of our relation classifier by replacing the \textsc{pcnn} by a \bertcoder{}. 12 \end{itemize} 13 In this section, we give further details about the relationship between our losses and the ones derived by \textcite{vae_re}. 14 As a reminder, their model is a \textsc{vae} defined from an encoder \(Q(r\mid \vctr{e}, s; \vctr{\phi})\) and a decoder \(P(\vctr{e}\mid r, s; \vctr{\theta})\) as: 15 \begin{marginparagraph} 16 The prior of a conditional \textsc{vae} \(P(r\mid\vctr{\theta})\) is usually conditioned on \(s\) too. 17 However, this additional variable is not used by \textcite{vae_re}. 18 \end{marginparagraph} 19 \begin{equation} 20 \loss{vae}(\vctr{\theta}, \vctr{\phi}) = \expectation_{Q(r\mid \vctr{e}, s; \vctr{\phi})}[ - \log P(\vctr{e}\mid r, s; \vctr{\theta})] + \beta \kl(Q(r\mid \vctr{e}, s; \vctr{\phi}) \mathrel{\|} P(r\mid\vctr{\theta})) 21 \label{eq:fitb:vae full loss} 22 \end{equation} 23 This is simply a rewriting of the \textsc{elbo} of Equation~\ref{eq:relation extraction:elbo} substituting relation extraction variables to the generic ones. 24 There is however two differences compared to a standard \textsc{vae}. 25 First, the variable \(s\) is not reconstructed, it simply conditions the whole process. 26 Second, the regularization term is weighted by a hyperparameter \(\beta\). 27 This makes the model of \textcite{vae_re} a conditional \(\beta\)\textsc{-vae} \parencitex{conditional_vae, beta_vae}[-11mm]. 28 The first summand of Equation~\ref{eq:fitb:vae full loss} is called the reconstruction loss since it reconstructs the input variable \(\vctr{e}\) from the latent variable \(r\) and the conditional variable \(s\). 29 Since we followed the same structure for our model, this reconstruction loss is actually \loss{ep}, the difference being in the relation classifier. 30 We can then rewrite the loss of \textcite{vae_re} as: 31 \begin{marginparagraph} 32 As explained Section~\ref{sec:relation extraction:vae}, \(Q\) is the \textsc{vae}'s encoder. 33 \end{marginparagraph} 34 \begin{align*} 35 \loss{vae}(\vctr{\theta}, \vctr{\phi}) & = \loss{ep}(\vctr{\theta}, \vctr{\phi}) + \beta \loss{vae reg}(\vctr{\theta}, \vctr{\phi}) \\ 36 \loss{vae reg}(\vctr{\theta}, \vctr{\phi}) & = \kl(Q(\rndm{r}\mid \rndmvctr{e}; \vctr{\phi}) \mathrel{\|} P(\rndm{r}\mid\vctr{\theta})) 37 \end{align*} 38 In their work, they select the prior as a uniform distribution over all relations \(P(\rndm{r}\mid\vctr{\theta}) = \uniformDistribution(\relationSet)\) and approximate \loss{vae reg} as follow: 39 \begin{equation*} 40 \loss{vae reg}(\vctr{\phi}) = \expectation_{(\rndm{s}, \rndmvctr{e})\sim \uniformDistribution(\dataSet)} \left[ - \entropy(\rndm{R} \mid \rndm{s}, \rndmvctr{e}; \vctr{\phi}) \right] 41 \end{equation*} 42 Its purpose is to prevent the classifier from always predicting the same relation, i.e.~it has the same purpose as our distance loss \loss{d}. 43 However, its expression is equivalent to \(-\loss{s}\), and indeed, minimizing the opposite of our skewness loss increases the entropy of the classifier output, addressing \problem{2} (classifier always outputting the same relation). 44 Yet, using \(\loss{vae reg}=-\loss{s}\) alone, draws the classifier into the other pitfall \problem{1} (not predicting any relation confidently). 45 In a traditional \textsc{vae}, \problem{1} is addressed by the reconstruction loss \loss{ep}. 46 However, at the beginning of training, the supervision signal is so weak that we cannot rely on \loss{ep} for our task. 47 The \(\beta\) weighting can be decreased to avoid \problem{1}, but this would also lessen the solution to \problem{2}. 48 This causes a drop in performance, as we show experimentally.